tmap and raster packageI’m a geographer by training with interest in (Geospatial) Data Science, Geocomputation and Data Visualization.
This document elaborates on the processing of NEXRAD hail data from the NOAA Severe Weather Data Inventory.
In this analysis, only data from the year 2015 is used.
I used this data to improve my skills in spatial data processing and visualization using mainly the tmap and raster package.
As the code was requested, I though it would be a good chance to get into RMarkdown!
If you find this interesting or helpful in any way you can follow me on my (Spatial) Data Science & Data Visualization journey (mainly R, Python, JavaScript) also on Twitter (@chvonmatt) or Github (@codicolus).
For this analysis and visualization I used the following datasets/data sources:
IMPORTANT: The hail data has to be downloaded manually (or via Kaggle)
+– data
|+—— nexrad-stations.txt
|+—— severeweatherdatainventory_2015.zip
||+———- hail-2015.csv
|+—— shapefiles
||+———- s_11au16.zip
|||+————– s_11au16.shp
|||+————– s_11au16.dbf
|||+————– s_11au16.prj
|||+————– s_11au16.shx
+– scripts
|+—— analysing_haildata_using_tmap.Rmd
+– output
|+—— figures
First, we have to set-up our R-environment. Here, this includes mainly loading our required libraries.
The tidyverse and lubridate packages are mainly used for the data manipulation. The sf package is used for spatial data reading and manipulation. Data visualizations are conducted using the tmap package (and later also the ggplot2 package as alternative approach).
# deactivating scientific notation
options(scipen=999)
# libraries
libs <- c("tidyverse", "lubridate", "sf", "tmap", "raster", "janitor", "rgdal", "rnaturalearth")
# check if libraries are available
pkgs_available <- libs %in% installed.packages()
if(!all(pkgs_available)){
print("Following packages need to be installed first:")
print(libs[!pkgs_available])
}
# load packages
invisible(lapply(libs, library, character.only=TRUE))
In this section we load both the NEXRAD hail data and also the US-States shapefile (see Section Data Sources).
The downloaded files are zipped. Thus, if not already done, we first extract/unzip the data.
From the output we can see that the US-States shapefile is a MULTIPOLYGON and is using the North American Datum (NAD83).
# Hail data - Severe Weather Data Inventory
# unzip if necessary + read-in
if(!file.exists("../data/hail-2015.csv")){
unzip("./data/severeweatherdatainventory_2015.zip", exdir = "../data")
}
hail_data <- read_csv("../data/hail-2015.csv")
## Parsed with column specification:
## cols(
## X.ZTIME = col_double(),
## LON = col_double(),
## LAT = col_double(),
## WSR_ID = col_character(),
## CELL_ID = col_character(),
## RANGE = col_double(),
## AZIMUTH = col_double(),
## SEVPROB = col_double(),
## PROB = col_double(),
## MAXSIZE = col_double()
## )
# US-States Shapefile
# unzip if necessary
if(!file.exists("../data/shapefiles/s_11au16.shp")){
unzip("./data/shapefiles/s_11au16.zip", exdir = "../data/shapefiles")
}
states_sf <- st_read(dsn = "../data/shapefiles/s_11au16.shp")
## Reading layer `s_11au16' from data source `C:\Projekte\Informatics\Data_Science\R\Kaggle\hail-2015\data\shapefiles\s_11au16.shp' using driver `ESRI Shapefile'
## Simple feature collection with 57 features and 5 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -179.1473 ymin: -14.37376 xmax: 179.7785 ymax: 71.38961
## geographic CRS: NAD83
Now, we’ll tidy our data - this includes adjusting column names, creating additional variables to facilitate the subsequent data analysis and visualization and also filtering out invalid data.
First, the hail observation dataset gets some tidying. The probability of hail and probability of severe hail (variables prob and sevprob) are defined from 0-100%. Entries containing values lower than zero (-999) should thus be filtered out. In this analysis we want to be sure to only include observations where hail detection is pretty certain. Thus, we even take it one step further and include only data with a 100% hail probability.
Auxiliary variables are then derived mainly from the timestamp-variable (after renaming: ymdhms). We’ll later use them to aggregate the data.
# clean names
hail_data <- hail_data %>%
janitor::clean_names()
# filter plausible data / sort out invalid data
hail_data <- hail_data %>% filter(prob >= 100 & sevprob >= 0)
# Converting ymdhms to datetime-object (lubridate-package)
# Adding auxiliary variables
hail_data <- hail_data %>%
rename(ymdhms = x_ztime) %>%
mutate(ymdhms = ymd_hms(ymdhms),
year = year(ymdhms),
month = month(ymdhms),
day = day(ymdhms),
hour = hour(ymdhms),
minute = minute(ymdhms),
sec = second(ymdhms))
Next we’ll add a season variable. Months are categorized according to their seasonal belonging. December to February (DJF) are considered as Winter, March to May (MAM) as Spring and so on.
As I am not fan of multiple nested ifelse-statements, I used an alternative approach using coalesce. First, we determine the belonging to the individual seasons separately, but let them coalesce to the season-variable within the same pipe. As you see from the ifelse-statements, the individual columns need to be complementary. This means only one of the four individual seasons has a value, the others are NA’s.
# add a season variable
hail_data <- hail_data %>%
mutate(month_name = month(month, label = TRUE, abbr = FALSE, locale = "English")) %>%
mutate(winter = ifelse(month_name %in% c("December", "January", "February"), "Winter (DJF)", NA),
spring = ifelse(month_name %in% c("March", "April", "May"), "Spring (MAM)", NA),
summer = ifelse(month_name %in% c("June", "July", "August"), "Summer (JJA)", NA),
autumn = ifelse(month_name %in% c("September", "October", "November"), "Autumn (SON)", NA)) %>%
mutate(season = coalesce(winter, spring, summer, autumn))
Less manipulation is needed to prepare the US-States multipolygon shapefile.
We only rename some variables and restrict the data to the Contiguous USA (CONUS) region.
# rename
states_sf <- states_sf %>% rename(state=STATE, state_name = NAME)
# US-State Names
state_names <- tibble(as.data.frame(states_sf)[,1:2])
# Names of contiguous US-States
contig_us <- setdiff(states_sf$state_name, c("Alaska", "Hawaii", "Puerto Rico",
"American Samoa", "Virgin Islands",
"Northern Marianas", "Guam"))
# RESTRICTION TO CONTIGUOUS USA
states_sf <- states_sf %>% filter(state_name %in% contig_us)
The Coordinate Reference System (CRS) of the hail observations (point data) must be specified first. The coordinate representations are in the World Geodetic System (WGS84, EPSG: 4326) as declared by the NOAA Severe Weather Data Inventory. In a later step we’ll create an auxiliary raster to assign each hail observation to a specific raster cell (see Section Maximum hail size and hail days). In order to obtain a meaningful quantitative measure (e.g. to allow for a spatial comparison), we must project the hail observations into a suitable coordinate system/projection first. The main purpose is that each raster cell with a certain hail-related value represents an equally sized area. For the USA, a suitable coordinate projection is the US National Equal Area (EPSG: 2163). The units of this projection are meter (m), so we’ll later be able to determine the raster resolution in m (km respectively). We also project the US States multipolygon from the North American Datum 83 (NAD83, EPSG: 2163) to the US National Equal Area projection.
More on how to choose a suitable projection / CRS can be read in the very informative blog post Choosing the right map projection by Michael Corey.
# transform hail_data to sf-object
hail_sf <- st_as_sf(hail_data, coords = c("lon", "lat"))
# set initial WGS84
hail_sf <- st_set_crs(hail_sf, 4326)
# transform to US National Equal Area
hail_sf <- hail_sf %>% st_transform(st_crs(2163))
states_usnational <- states_sf %>% st_transform(st_crs(2163))
# add transformed coordinates as columns (additionally to geometry)
transformed_coords <- st_coordinates(hail_sf)
hail_sf <- hail_sf %>% add_column(lon_transf = transformed_coords[,1], lat_transf = transformed_coords[,2])
rm(transformed_coords)
Now we create the auxiliary raster to later rasterize the hail observations.
For this we use the boundary box of the US States sf-object (US National Equal Area projection) as guide. The extent is however slightly adjusted to allow for a more convenient specification of the grid cell resolution. As for the seq-function only positive values are valid, the coordinates must be shifted after creation (- abs(bb[1]/1000)). Here, bb[1] corresponds to the bounding-box xmin, bb[2] to ymin, correspondingly.
For this analysis, a raster resolution of 25x25km (25km2) is chosen.
# BOUNDING BOX US-States in US National Equal Area Projection
bb <- st_bbox(states_usnational)
# maximum extents in x,y directions (km)
(dim_x_orig <- (bb[3] - bb[1]) / 1000)
## xmax
## 4548.293
(dim_y_orig <- (bb[4] - bb[2]) / 1000)
## ymax
## 2849.328
# adjusted raster dimensions + resolution
dim_x <- 4550
dim_y <- 2850
res_km <- 25
# longitude / latitude value vectors
lon_vals <- seq(0, dim_x, res_km) - abs(bb[1] / 1000)
lat_vals <- seq(0, dim_y, res_km) - abs(bb[2] / 1000)
lon_vals <- lon_vals * 1000
lat_vals <- lat_vals * 1000
# Creating the auxiliary raster
aux_raster <- raster(matrix(NA, ncol = dim_x / res_km, nrow = dim_y / res_km))
# set projection
us_national_equalArea <- rgdal::make_EPSG() %>% filter(code == 2163)
projection(aux_raster) <- us_national_equalArea$prj4 # proj4-string
# set extent
extent(aux_raster) <- c(min(lon_vals), max(lon_vals), min(lat_vals), max(lat_vals))
# dimensions
aux_raster
## class : RasterLayer
## dimensions : 114, 182, 20748 (nrow, ncol, ncell)
## resolution : 25000, 25000 (x, y)
## extent : -2031905, 2518095, -2116951, 733048.7 (xmin, xmax, ymin, ymax)
## crs : +proj=laea +lat_0=45 +lon_0=-100 +x_0=0 +y_0=0 +a=6370997 +b=6370997 +units=m +no_defs
## source : memory
## names : layer
## values : NA, NA (min, max)
length(lat_vals)
## [1] 115
length(lon_vals)
## [1] 183
In this step, the hail data is restricted to the extent of the just created auxiliary raster.
We also create a little auxiliary function here. The auxiliary function determines the closest coordinates to the longitude/latitude value vectors. Only longitude/latitude values lower or equal than a certain point are considered, such that the hail observations are matched to the correct raster cell. This is because the longitudes/latitudes are bounding the raster-cells and thus have larger dimensions (dim + 1, see above!) than the raster itself. Hail observations which are closer to the upper boundary longitude/latitude would therefore get falsely matched with the subsequent raster cell and not the one they’re located in!
Analogously, the last value of the longitude/latitude vectors is excluded and no valid option as there are no more raster cells after the boundary longitude/latitude values. (Though, this should not be a problem as the data is already confined to the bounding box).
# Restricting data to generated raster extent
hail_sf <- hail_sf %>%
filter(lon_transf >= min(lon_vals) & lon_transf <= max(lon_vals)) %>%
filter(lat_transf >= min(lat_vals) & lat_transf <= max(lat_vals))
# closest coordinates to match datapoints to raster cell
get_closest_coords <- function(value, lon_lat){
#return(which.min(abs(lon_lat[which(lon_lat <= value)] - value)))
return(which.max(lon_lat[which(lon_lat <= value)]))
}
# determine the closest corresponding raster cell for each observation
coords <- st_coordinates(hail_sf)
lon_close <- as.data.frame(coords[,1])
lat_close <- as.data.frame(coords[,2])
lon_close <- apply(lon_close, 1, get_closest_coords, lon_vals[1:length(lon_vals)-1])
lat_close <- apply(lat_close, 1, get_closest_coords, lat_vals[1:length(lat_vals)-1])
hail_sf <- hail_sf %>% add_column(lon_close = lon_close, lat_close = lat_close)
hail_sf <- hail_sf %>% mutate(lon_usnat = lon_vals[lon_close], lat_usnat = lat_vals[lat_close])
A short illustration of what would happen if this would not be adjusted for in the auxiliary function is provided here. The misclassification does obviously happen for every cell and not only the last - which would cause an error anyways.
Figure 1: Misclassification of hail observations
There’s one step left before we get to visualize the data: Creating meaningful quantitative variables to visualize!
One variable which is of interest considering hail damage is the hail size. We want to analyze how big the largest hailstones in 2015 were and where they were detected. This variable can be achieved simultaneously to rasterizing the data with the rasterize-function by using the fun argument. Also, our auxiliary raster is now used as template for the maximum-hailsize raster (output).
The rasterize function takes two columns with longitudes and latitudes. Here, the added columns with the transformed coordinates (into US National Equal Area projection) are used. As sf objects always have a geometry column, we first have to set the geometry to NULL. We replace the hail_data object with this “new” data-frame.
The template raster should have the same specifications as desired output-raster.
We then provide the variable to rasterize - in our case the maximum hail sizes (maxsize). After the rasterization, every 25x25km raster cell should contain only the largest maximum hail sizes detected in the year 2015, thus we use fun = max. A more detailed reference to the rasterize function can be found here.
# drop geometry column
hail_data <- hail_sf %>% st_set_geometry(NULL)
# rasterize and aggregate hail sizes by maximum
max_size <- rasterize(hail_data[,c("lon_transf", "lat_transf")], aux_raster,
hail_data[,c("maxsize")], fun = max)
To calculate the hail days, days with an observation of 100%-hail probability are summed up for each raster cell. For this, we use some dplyr-magic!
First, we group the data with all variables we want to keep. The first variables in the grouping are the auxiliary date-related variables we created in Section Setup → Hail Observations. Further, we now take advantage of the previously determined raster cells, such that a specific day with a hail observation does only count ONCE for a specific raster cell. As we want every hail observation on one specific day to count only once, we do not sum all observations up but set the counts to one (n = 1).
In a second step, we sum all hail days for each raster cell. To do so, we group the data by the raster cells (coordinates).
Similar to the maximum hail size, we then rasterize the observations. A function is not needed, as we already have determined the raster cells. Using a grouped summary results in only one value for each raster cell (the sum of hail days).
# count hail days
hail_days <- hail_data %>%
group_by(month, day, lon_usnat, lat_usnat) %>%
summarize(n = 1) %>%
dplyr::select(-n) %>%
ungroup()
hail_days <- hail_days %>% group_by(lon_usnat, lat_usnat) %>% summarize(days = n()) %>% ungroup()
hail_days <- rasterize(hail_days[,1:2], aux_raster, hail_days[,3]) # function doesn't matter here
Yeppa! After all the hard work, it’s time for visualizing the results. For this analysis we make use of the tmap-package. Similar to ggplot2, tmap also makes use of the Grammar of Graphics (see Wickham, 2010).
Before we use tmap, one (aesthetical) preparation is missing: masking the max_size and hail_days raster to the US States multipolygon such that only values laying on land get displayed! For this purpose, the raster::mask() function suits our needs.
# masking maximum hail sizes + hail days
max_size_masked <- mask(x = max_size, mask = states_usnational)
hail_days_masked <- mask(x = hail_days, mask = states_usnational)
# naming the masked raster layer
names(max_size_masked) <- "Size (inches)"
names(hail_days_masked) <- "Hail Days in 2015"
For each visualization, the generated raster with the variables of interest and the US States multipolygon dataset is used. To visualize the raster with tmap, we also provide the bounding box (it’s however not necessary here). Further, we set alpha = 0 for the faces of the US States polygons. This is required as the multipolygon-layer is overlaid onto the raster-layer which would be invisible if the polygon faces weren’t transparent. The inner.margins = c(bottom, left, top, right) argument is used to add some space at the bottom and the top of the map. If you need more detailed explanations on how to use the tmap-package or if you want to get started with the tmap-package, you may check out the tmap-getstarted-vignette! Further, also the book Geocomputation in R by Robin Lovelace, Jakub Nowosad and Jannes Muenchow is an extremely useful resource for spatial mapping!
# mapping maximum hail sizes
map_sizes <- tm_shape(max_size_masked, bbox = bb) +
tm_raster(title = "Size (inches)") +
tm_legend(legend.position = c("left", "bottom"), scale=1.2, legend.height=0.4, legend.width=0.5) +
tm_shape(states_usnational) +
tm_polygons(alpha = 0, border.col = "darkgray", lwd = 1.2) +
tm_layout(title = "Radar estimated maximum hail sizes in 2015",
title.size = 1.2,
frame = FALSE,
title.position = c("left", "top"),
inner.margins = c(0.10, 0, 0.08, 0)) +
tm_credits("Plot by Christoph von Matt / @chvonmatt \nGithub: https://github.com/codicolus \nData: NOAA Severe Weather Data Inventory", align="left", position = c(0.2, 0), size = 0.6) +
tm_credits("Projection: US National Equal Area \nGrid resolution: 25x25km \n", align = "left",
position = c(0.57, 0), size = 0.6)
# save map
#tmap_save(map_sizes, "./output/figures/maxhailsizes_2015.png", dpi=300)
Figure 2: Maximum (radar estimated) hail sizes observed in the US in the year 2015.
# mapping hail days
map_haildays <- tm_shape(hail_days_masked, bbox = bb) +
tm_raster(title = "Number of Days") +
tm_legend(main.title = "Hail days in 2015 (contiguous USA)", legend.position = c("left", "bottom"),
scale=1.2, legend.height=0.4, legend.width=0.5) +
tm_shape(states_usnational) +
tm_polygons(alpha = 0, border.col = "darkgray", lwd = 1.2) +
tm_layout(title.size = 1.2,
frame = FALSE,
title.position = c("left", "top"),
inner.margins = c(0.10, 0, 0, 0)) +
tm_credits("Plot by Christoph von Matt / @chvonmatt \nGithub: https://github.com/codicolus \nData: NOAA Severe Weather Data Inventory", align="left", position = c(0.2, 0), size = 0.6) +
tm_credits("Projection: US National Equal Area \nGrid resolution: 25x25km \n", align = "left",
position = c(0.57, 0), size = 0.6)
# save map
#tmap_save(map_haildays, "./output/figures/haildays_2015.png", dpi=300)
Figure 3: Maximum (radar estimated) hail sizes observed in the US in the year 2015.
Although the main purpose of this work is to illustrate on how to analyze and visualize hail data from NOAA’s Severe Weather Data Inventory, I nevertheless add a short discussion of the results here.
The radar estimated maximum hail sizes for the year 2015 depict a pretty distinct spatial pattern: The largest hail sizes were detected over the Great Plains. Hail sizes up to 4 inches are mainly distributed across South Dakota, Nebraska, Colorado, Kansas and Texas. In the other states, such very large hailstones were only detected sporadically. Most observations were smaller hail sizes (up to ca. 2.5 inches). Comparatively smaller hail sizes, or few to no hail signatures, respectively, were detected over parts of the Great Basin, the Rocky Mountains and also some coastal areas.
Regions with few or no detected hail signatures are also reflected in the hail days map. The region where most hail days were registered in 2015 is again located over the Great Plains. An interesting difference is however, that there were also many hail days registered over Florida, a state which didn’t stand out regarding the radar estimated maximum hail sizes.
To discuss the results, we’ll also need to consider the locations of the radar sites used in this analysis. To map the NEXRAD radar sites, I had to cheat a little bit - I used QGIS 3 and Map-Tiles provided by the US Geological Survey (USGS).